Characters and patterns of communities in networks 
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A community can be seen as a group of vertices with strong cohesion among themselves and weak cohesion 
between each other. Community structure is one of the most remarkable features of many complex networks. 
There are various kinds of algorithms for detecting communities. However it is widely open for the question: 
what can we do with the communities? In this paper, we propose some new notions to characterize and 
analyze the communities. The new notions are general characters of the communities or local structures 
of networks. At first, we introduce the notions of internal dominating set and external dominating set 
of a community. We show that most communities in real networks have a small internal dominating set 
and a small external dominating set, and that the internal dominating set of a community keeps much of 
the information of the community. Secondly, based on the notions of the internal dominating set and the 
external dominating set, we define an internal slope (ISlope, for short) and an external slope (ESlope, for 
short) to measure the internal heterogeneity and external heterogeneity of a community respectively. We 
show that the internal slope (ISlope) of a community largely determines the structure of the community, 
that most communities in real networks are heterogeneous, meaning that most of the communities have a 
core/periphery structure, and that both ISlopes and ESlopos (reflecting the structure of communities) of all 
the communities of a network approximately follow a normal distribution. Therefore typical values of both 
ISolpcs and ESoples of all the communities of a given network are in a narrow interval, and there is only 
a small number of communities having ISlopes or ESlopes out of the range of typical values of the ISlopes 
and ESlopes of the network. Finally, we show that all the communities of the real networks we studied, have 
a three degree separation phenomenon, that is, the average distance of communities is approximately 3, 
implying a general property of true communities for many real networks, and that good community finding 
algorithms find communities that amplify clustering coefficients of the networks, for many real networks. 

Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications — Data mining 

General Terms: Measurement; Experimentation 

Additional Key Words and Phrases: community, internal dominating set, external dominating set, internal 
slope, external slope 

1. INTRODUCTION 

Real networks differ from random graphs in the way that they are organized with a high 
level of order. Such an organization results to remarkable common phenomena of real net- 
works, for instance: the heavy tail degree distributions, the high clustering coefficients and 
the small average distances etc [Barabasi and Albert 19991 |Watts and Stro gatz 1998]. In 
addition, another remarkable common feature in various networks is the community struc- 
ture. Community is an important notion to disclose the structure of networks, playing the 
role in bridging the local vertices and the global network. On one hand, we could extract 
communities from a network to study its internal structure and its relationship with the rest 
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of the network from the local point of view. On the one hand, we could take each commu- 
nity as a unit of the network, to illustrate the connecting patterns of different communities 
of real networks through the distributions of different properties of communities from the 
global point of view |De Nooy et al. 2011 . 

Massive work has been devoted to the study of communities, including the main defini- 
tions of the community problem, algorithms developing for finding communities, comparison 
and tests of different algorithms etc [Fortunato 2010] . Leskovec et al. [Leskovec et al. 2009] 
analyzed community structures in large real networks and tried to find the "best" commu- 
nities at various sizes. They showed that the "best" communities seem to be characterized 
by size of 100. The distribution of sizes of communities has also been studied, showing that 
in some cases, they have the skewed distribution [Clauset et al. 20041 INewman 2004b) . The 
small community phenomenon was introduced recently, that is, there are models, classical 
or new, such that networks from the models are rich in small communities, that is, quality 
communities of small sizes [Li and Peng 2011] |Li and Peng 2012] , for which the mechanism 
is homophyly. 

Intuitively speaking, a community of a network can be interpreted as a relatively indepen- 
dent and stable unit of the network, and the rich communities of a network are taken as the 
local structures of the network. This suggests fundamental questions such as: What can we 
do with the communities? Are there some characters of all the communities of a network? 
What information of the network can we extract from the communities? What characters of 
communities (largely) determine the local patterns of the network? What are the relation- 
ship between the found communities and the true communities? These questions are widely 
open in the current state of the art. This motivates the research in the present paper. For 
this, we investigate the following: (1) How to extract central nodes from a community? (2) 
How to extract useful information from the communities? (3) How do communities interact 
with each other? (4) How to measure the heterogeneity of a community? (5) What general 
properties do the communities (found by a reasonably good algorithm) have? 

By using a variant of the local spectral partitioning algorithm [Andersen et al. 2006] , we 
find rich communities in real networks. These networks include collaboration networks, ci- 
tation networks, email networks and one benchmark network Q [Girvan and Newman 20021 
ILeskovec et al. 2007) . In collaboration network a node denotes a scientist and an edge in- 
dicates that the two scientists have coauthored a paper. In the citation networks a node 
denotes a paper in some fields and an edge between two papers indicates that at least 
one paper has cited the other. Communities in this networks may correspond to differ- 
ent research groups or research themes. Two email networks are also used in our study, 
in which each node corresponds to an email address and an edge between nodes i and j 
represents i sending at least one message to j or j sending at least one message to i. A 
well known benchmark network of American college football teams complied by Grivan and 
Newman [Girvan and Newman 2002] is also used. Nodes of the network represent teams 
and an edge between two nodes represents that the corresponding two teams play against 
each other. The network contains 12 true communities, which correspond to 12 different 
conferences that the teams belong to. All networks above have good community structures 
so that they are good candidates for investigating the characters and connecting patterns 
of local structures of networks. 

We organize the paper as follows. In section [2] we propose the notions of internal dom- 
inating ratio and external dominating ratio to measure the importance of a subset of a 
community. Then we give the definition of internal dominating set (IDS) and external dom- 
inating set (EDS). In section[3l we verify that, the internal dominating set of a community is 



1 A\\ the data in this paper can be found from the websites: http://snap.standford.edu] or 
http://www-personal.umich.edu/~mejn/netdata and we only consider the corresponding undirected 
graphs. 
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much more smaller than the community and keeps largely the information of the community. 
In section HJ we define internal slope (ISlope) and external slope (ESlope) of a community 
to measure the internal heterogeneity and the external heterogeneity of the community, 
respectively. We analyze the relationship between the structure and the ISlopcs and give 
the distributions of the ISlopes and the ESlopes of all the communities of the real networks. 
In Section [SJ we analyze more general properties like average distances, diameters and 
clustering coefficients of all the communities for each of the networks. Finally, in section [6j 
we summarize the conclusions of the paper. 

2. INTERNAL AND EXTERNAL DOMINATING SETS 

Table I. Statistics of real networks. All the results are calculated by 
averaging the corresponding properties of all the communities. The IDR 
and EDR are the ratios of centrality of 5-IDS and 5-EDS; the IDN and 
EDN are the sizes of 0.8-IDS and 0.8-EDS 



Network 


IDR 


EDR 


IDN 


EDN 


ISlope 


ESlope 


football 


0.99 


0.61 


2.6 


9.3 


0.19 


0.37 


citjicpth 


0.75 


0.49 


10 


32 


0.41 


0.54 


cit_licpph 


0.73 


0.39 


12 


56 


0.5 


0.54 


coLastroph 


0.93 


0.79 


3.7 


8.1 


0.36 


0.65 


coLcondmat 


0.85 


0.79 


9.6 


16 


0.42 


0.66 


col_grqc 


0.94 


0.91 


3.1 


3.9 


0.37 


0.67 


coLhepth 


0.69 


0.64 


23 


27 


0.38 


0.64 


coLhcpph 


0.8 


0.7 


11 


16 


0.38 


0.64 


email-enroll 


0.93 


0.86 


3 


7.8 


0.55 


0.68 


cmaiLcuall 


0.98 


0.95 


1.7 


2.4 


0.92 


0.89 



Given a community of a network, we may want to extract a small set of nodes that are 
more central to the community than the rest of nodes in the community. Taking the citation 
network for an example, we are interested in a small number, 10 say, of important papers 
that are central to the whole community which usually includes hundreds of papers. In this 
case, we would hope that with the short list of key papers, we will not lose any essential 
information of the whole community. This analysis of centrality has been studied for the 
whole networks, for example, it was shown that a small fraction of nodes accumulates a 
large proportion of links in the networks [Newman 2004a] . and that only 20% of most- 
linked authors in Economics account for about 60% of all the links |Goyal et al. 2006] . So 
there are indeed some nodes taking the central position in networks. We believe that similar 
centrality phenomena occurs in true communities of many real networks, and that the main 
goal of community finding algorithms is to find the true communities of the networks. 
The question is: what can we say about the centrality of the communities found by our 
algorithms? This would be the first step to understand the relationship between the true 
communities and the communities found by algorithms. 

Some centrality measures, initially introduced in social studies, could be used, for in- 
stance, the degree centrality, the closeness centrality, and the betweenness centrality etc 
[Freeman 1979] , These measures assume a relationship between the structural position and 
influential power in group processes Bavelas 1948], and are developed and widely used in 
the literature [Nicosia et al. 2012] . The mechanism behind this idea is that the centrality 
of a vertex could be predicted from its position and the network structure in which it was 
embedded as well as from its own characteristics |Rogers 1974] . Except for these centrality 
measures, vertices could also be classified according to their roles within their communities. 
Guimera and Amaral decide the role of a vertex by a within-module degree Zi and a par- 
ticipation ratio Pi and distinguish seven roles that vertices can play, based on the values of 
the pair (z,P) [Guimera and Amaral 2005] . 
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Fig. 1. A community to illustrate the IDR and EDR. The red nodes are from one community. Let S = 
{vs,V4} be a subset of this community, N(S,C) = {vo,V2,vs}, N(S,C) = {ve,vg}, N(C,C) = {ve,vr,vg}, 
so IDR(S) = (2 + 3) / 6 = 5/6, EDR(S) = 2/3 

In this section, we propose the notion of internal and external dominating sets of a 
community by modifying the notion of the dominating set. The dominating set problem 
is classical in graph algorithms: Given a graph G = (V,E), we say that a set S C V is a 
dominating set if every node v € V is either an element of S or adjacent to an element 
of S. The dominating number is the number of vertices in a smallest dominating set for 
G [Allan and Laskar 19781 |Haynes et al. 1998 



For a community, we distinguish two roles that nodes can play in a community, as an 
internal role and an external role, measured by links within and outside of the community 
respectively. For a subset of a given community, its internal dominating ratio (IDR, for 
short) is defined as follows. 

Let C be a community, S be a subset of C, N(S, C) be the neighbors of S within com- 
munity C . Then we define the internal dominating ratio of S in C, written by IDR, as 
follows: 

\SUN(S,C)\ \SUN(S,C)\ 
IDR(5) = \CUN{C,Q\ = \C\ (1) 

The dominating ratio has been used previously to measure the social centrality in social 
networks [Freema n 1979j . Our internal dominating ratio (IDR) measures the importance of 
a group of nodes in a community, and thus it can be seen as a general format of degree 
centrality of communities. 

Following the definition above, we consider two problems: 1) when given a number k 
(usually small), we want to find a subset S of size k with max{IDR}, in which case, we call 
this subset a fc-IDS; 2) when given a real number p in [0, 1], we want to find a subset S 
whose IDR is bigger than p with the minimum number of nodes, in which case, we call this 
subset a p-IDS. 

Similarly to IDR, we give the definition of external dominating ratio (EDR). Let C be a 
community, S be a subset of C, N(S, C) be the neighbors of node v that are outside of C. 
Then the external dominating ratio ( EDR) of S in C is defined as follows: 

EDR(S) = (2) 
v ; \N{C,C)\ 1 ' 

We also give the notations fc-EDS and p-EDS similarly. Figure Q] is an example of the 
IDRs and EDRs. From the definitions, we notice that we are not using the notion of classic 
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ALGORITHM 1: Finding p-{IDS} 

Input: Graph G, community C, and a real number p £ [0, 1]. 
Output: The p-internal dominating set S. 

Let Gc be the induced subgraph of vertices C from G. Set S — 0; 
repeat 

Let i) be a node in C \ S such that v has the maximal number of neighbors in C\ (S U N(S)), 
where N(S) is the neighbors of S in Gc', 
SU{v} ; 
until IDR(S) < p; 



ALGORITHM 2: Finding p-{EDS} 

Input: Graph G, community C, and a real number p £ [0, 1]. 
Output: The p-external dominating set S. 

Set S = 0, iV^S 1 , (7) = 0, where iV^S 1 , C) is the number of neighbors of S outside of C; 
repeat 

Let v be a node in C \ S such that v has the maximal number of neighbors in 

N(C,C)\N(S,C); 

S ^SU{«}; 

N(S, C) <- JV(5, (5) U iV(v, C); 
until EDR(S) < p; 



dominating set [Allan and Laskar 19781 Haynes et al. 1998] , instead, we introduce two pa- 
rameters k and p to define the general format of dominating sets. We emphasize that the 
classification are based on nodes positions in a community. By definition, it is conceivable 
that nodes in the IDS arc more important for the function and stability of the commu- 
nity, and that nodes in the EDS mainly take charge of the communication between the 
community and the nodes outside of the community. 

The dominating problem is an NP-complete decision problem |Haynes et al. 1998] . Here 
we introduce a simple greedy algorithm to find the p-IDS and p-EDS, where G is a graph, 
C is a community and p is a real number in [0,1]. 

Given a number p between and 1, we could find the p-{IDS} and p-{EDS} by using the 
above algorithms. Similarly when given a small number k, wc could calculate the /c-{IDS} 
and fc-{EDS} by using the same algorithm with slight modification of the terminating 
condition. In our experiment, we set k = 5 when calculating the fc-{IDS} and the fc-{EDS}, 
and set p = 0.8 when calculating the p-{IDS} and p-{EDS}, see Table |T] for details. 

From Table HI we observe that only five nodes could dominate most of the members of the 
communities from both internal and external sides, that the internal dominating ratios of 5 
internally central nodes are larger than the external dominating ratio of 5 externally central 
nodes, for each of the networks, that external connecting patterns of the communities are 
more decentralizing than that of the internal connecting patterns, for each of the networks, 
that it only needs at most 10 nodes to internally dominate at least 80% of the whole 
community, that it needs at most 32 nodes to externally dominate 80% of the outgoing 
links of the communities, and that external dominating numbers arc larger than the internal 
dominating numbers for all communities and for all the networks. 

In summary, we have that most communities have a small internal dominating set, and a 
small external dominating set, which is slightly larger than the internal dominating set of 
the corresponding communities, on the average, for all the networks. 
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ALGORITHM 3: Predicting keywords using internal dominating set 

Input: Graph G, community C, and keyword dictionary Die 
Output: Papers with predicting keywords 
Calculate p-IDS or fc-IDS of C; 

Suppose that L — {k\, ki, ■ • • , ki} are listed keywords from the IDS with descending order 
according to their popularity in C. For a given paper P in C whose keywords are not listed in the 
network, for each j < i, if kj appears in either the title or the abstract of paper P, we say that kj 
is a predicted and confirmed keyword of P; 



3. EXTRACTING LOCAL INFORMATION 

In the last section, we verify that most communities have a small internal dominating 
set, and a small external dominating set. The questions are: How much information of a 
community is preserved in the dominating set of the community? How to extract essential 
information of a community from the small dominating sets? 

In this section, we verify that the internal dominating sets (IDSs) indeed preserve essential 
information of the communities. We verify this result by predicting and confirming keywords 
of papers in a citation network. 

We say that a paper has keywords, if its authors have explicitly list its keywords, and 
does not have keywords, otherwise. 

Keywords of papers play an important role in information retrieval. In many citation 
networks, there is a huge number of papers whose keywords arc not listed by their authors, 
which is an obstacle for people to sufficiently use the nctworktQ. 

In the citation-hepth networks, there are about 27, 770 papers, in which only 10% or so 
have keywords. Predicting and confirming the missing keywords for the other papers are 
obviously significant for information retrieval. 

Given a community C in a citation network, we predict and confirm keywords for papers 
in C by the following procedure. 

We choose parameter p = 0.8, run the algorithm on the citation network, and report the 
results in Table [TTJ The first column of Table |H] presents the number of keywords we used 
for the prediction and confirmation for each communities, that is, the length i of L in the 
algorithm, the second column of the table are numbers of papers whose keywords have been 
predicted and confirmed corresponding to different lengths of L in the first column. 

From Table HIl taking the first row of the table for example, we know that if we use the 
most popular 5 keywords appearing in the IDS of each of the communities, then there are 
13, 283 papers in the network whose keywords are predicted and confirmed. As the number 
of keywords used in the algorithm, i.e., the lengths of L in the algorithm, becomes larger, 
we can predict and confirm keywords for more papers, that is up to 14, 691 papers. The 
results show that the IDS is much smaller than the corresponding community and that the 
IDS preserves much information of the corresponding community. From the experiment, 
it is conceivable that in practical applications, it is sound to recommend the IDS of a 
community instead of the whole community which is usually much larger. The result above 
is unexpectedly good. We believe that this property may hold for many other networks other 
than citation networks, that is, the internal dominating set of a community keep essential 
information of the community. More importantly, the essential information of the internal 
dominating set of a community can be easily extracted. 



2 We implement the verification for just one citation network, because this is the only available network in 
which titles, abstract of papers, and keywords of a small number of papers arc included. Most networks 
have a topological structure with nodes and edges only. 
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Table II. Using 0.8-IDS to predict keywords in citation 
network hepth 



Keyword Number 


Predicted Paper Number 


5 


13283 


10 


13906 


15 


14375 


20 


14592 


25 


14641 


30 


14647 


35 


14654 


40 


14691 


45 


14691 


50 


14691 



4. INTERNAL AND EXTERNAL SLOPES 

In the last section, we show that most communities have a small IDS and a small EDS, and 
that the small IDS of a community preserves much information of the community. 

In this section, wc will show that the IDS and EDS of a community usually take the central 
positions in the community with low degree nodes around them, so that the community 
forms a core/periphery structure. 

Intuitively speaking, if all nodes in a community have equal position, i.e., the regular 
graph or a random graph, then they are homogeneous; if nodes in a community form a 
core/periphery structure, i.e., the star-like graphs, then they are heterogeneous. Our main 
question is: How do the IDS and EDS of a community reflect the homogeneity or the 
heterogeneity of the community? 

Before answering this question, we look at the power law distribution. It was shown that 
most networks follow a power law distribution [Barabasi and Albert 1999 , meaning that 
the number of nodes of degree k is proportional to fc - ' 9 . A power law distribution of power 
exponent /?, which is typically lying in the range 2 < /3 < 3, measures the heterogeneity 
of a network. However it is nontrivial to estimate the exponent j3, especially for small 
networks, and not all networks follow the power law distribution [Clauset et al. 2009] , Most 
communities are small, although they may have heavy tail degree distributions, it is not clear 
whether they have power law distributions. More seriously, even if the communities have 
power law distributions, fluctuations caused by the small sizes of communities may make 
the result inaccurate, and the number of communities is large, it is hard to characterize 
the power law distributions of all the communities. Therefore the power exponent (3 is 
not suitable to measuring the heterogeneity of all the communities of a network. Another 
measure is to notice the relationship between the number of dominating set and the degree 
distribution. In fact, it was shown that the more heterogeneous the degree distribution 
of a network is, the smaller the number of dominating set is |Nacher and Akutsu 2012] , 
This suggests that the internal and external dominating sets are closely related to the 
heterogeneity of the communities. 

We now measure the heterogeneity of co mmunities by the internal and external domi- 



nating sets of communities. See figure 2(a) in which case the community is homogeneous. 
All members of the community have equal position, and any single node could dominate 
the whole community. From the dominating number, we could not know the heterogeneity 
of the community. So the dominating set itself is insufficient to measure the homogeneity 
and heterogeneity of a community. To solve this problem, we use the internal dominating 
ratio (IDR) of the internal dominating set (IDS), together with the expectation internal 
dominating ratio (IDR) of random selection of nodes of the same size as that of the IDS. 

We define the internal slope (ISlope, for short) and external slope (ESlope, for short) 
of a community to measure the internal and external heterogeneity (or the core/periphery 
structure) of the community. Intuitively, the ISlope of a community is to measure the 
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(c) ISlope = 0.72 (d) ISlope very near 1 

Fig. 2. Real communities to illustrate ISlope. All of them except (d) are from collaboration grqc network 
and (d) is from emaiLcnron network. In each figure, red nodes come from the same community. 

distance between the community and regular graphs or star-like graphs, and the ESlopc 
of a community is to illustrate whether the community is connected with the rest of the 
community evenly or through a small number of nodes like a funnel. 

Let C be a community, p £ [0, 1] be a real number. Suppose that K is the size of the 
p-IDS of C, that S be the p-IDS of C, and that V = {V\, V2, • • • , Vm} is the set of all subsets 
of C of size K . Then define the internal slope of C, written by ISlopc(C) as follows: 

ISlope(C) = IDR(S) - ^xevIDRPO (3) 

The ISlope of a community represents the difference between the internal dominating ratio 
of the most central nodes and the expectation internal dominating ratio of random choices 
of nodes of the same size. It measures the homogeneity and heterogeneity (core/periphery 
structure) of the community from the internal point of view. We extract some communities 
of real networks found by our algorithm in Figure [2] From these figures we can observe 
that the ISlopes and ESlopes of the communities largely reflect the homogeneity and the 
heterogeneity of the corresponding communities. 

By observing Figure [5J we know that the structures of communities are closely related to 



the corresponding ISlopes of the communities. In particular, in Figure 2(a) all nodes have 



equal position and a single node could dominate the whole comm unity; 111 Figure 2(b) there 



are some central nodes with periphery nodes around; in Figure 2(c) the central position 
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(a) ESlope = 



(b) ESlope = 0.31 




(c) ESlope = 0.52 



(d) ESlope = 0.97 



Fig. 3. Real communities to illustrate ESlope. All of them are from collaboration grqc network. In each 
figure, red nodes come from the same community. 



of one node is more obvious, and the structure is a star-like graph; in Figure 2(d)j the 
community is a star graph with a hub in the center, and the ISlope of the community is 
very near 1. Notice that a star graph is the most heterogeneous community, in which the 
hub in its center is the most important node. In summary, we observe that the smaller the 
ISlope of a community is, the more homogeneous a community is, and that on the contrary, 
the larger the ISlope of a community is, the more heterogeneous a community is, and that 
the ISlope of a community roughly reflects the pattern or structure of the community. 

Similarly to ISlope, we define the external slope of a community (ESlope) to measure the 
external heterogeneity of the community. By using the ESlope of a community, we arc able 
to examine the pattern that nodes in a community connect nodes outside of the community. 
Whether or not nodes in a community connect the rest of the community through a small 
number of representatives or evenly through most members. 

It has been shown that in a collaboration network, most people in the network (theme, 
or topic) contact people in the network through just one or two of their best-connected 
collaborators [Newma n 2004al INewman et al. 2001) . 

Our results show that such a funneling pattern of connections from a community to 
outside of the community is very popular in all the communities of a network, for a wide 
range of real networks. 
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Let C be a community, p G [0, 1] be a real number. Suppose that K is the size of a p-EDS 
of C, that V = {Vi, Vz, • • • , V m } is the set of all subsets of C of size K. Then we define the 
external slope of C (ESlope(C)) as follows: 

ESlope(C) = EDR(EDS) - ^Yev^OO (4) 

The ESlope of a community represents the difference between the external dominating 
ratio of the most central nodes and the expectation external dominating ratio of random 
selection of nodes of the same size. 

Figure [3] illustrates different connecting patterns of communities with different ESlopes. 
In these figures, we also keep the neighbors and th e neig hbors of neighbors of the community 
to highlight their connecting patterns. In figure 3(a)| all members have equal position to 



connect with nodes outside of the community. Some nodes only have internal links, while 
others have both external and internal links in figure 3(b) Also, s ome nodes play the role 
of bridge in linking nodes in and outside of its community in figure 3(c) At last, figure [3(d"J| 



shows a community in which only one node is the bridge. All other members communicate 
with the outside world through this node. The ESlope indeed identifies different connecting 
patterns of how communities connect with each other. 

Table [I] gives the average ISlopes and ESlopes of all the communities of various networks. 
Except for the football and the emaiLeuall, all other networks have similar ISlopes and 
ESlopes with ESlopes larger than ISlopes, on the average. ISlope and ESlope of a community 
quantify the core/periphery structure of the community. Our results indicate that such 
structures are universal in real networks and that real networks tend to avoid communities 
of cither regular or star-like graphs and have structures with ISlopes and ESlopes in some 
fixed interval, that is, the ISlopes are roughly in [0.35,0.55] and the ESlopes in [0.5,0.7]. 

These results pose a question that why networks tend to have such structures. We try to 
explain these as follows: For a community, it is possible that some key nodes are essential 
to its formation and evolution. On one hand, it is unusual to have a community with all 
members having equal position for a long period of time. On the other hand, the key nodes of 
a community should not be too strong or too weak since otherwise, the community structure 
may be fragile. It is intuitive that if the central nodes of a community breakdown, then the 
community structure would not exist any more. Therefore too big ISlopes or ESlopes and too 
small ISlopes or ESlopes will both go ill with the evolution of communities. The structures 
of typical communities of a real network may be a compromise between the effectiveness and 
robustness of the communities. We conjecture that the ESlopes may largely determine the 
evolution of communities, which needs to be further investigated ( in our on going project). 

Besides the average values, we also report the distributions of the ISlopes and ESlopes in 
figure @] and figure [5] of all the communities of the real networks. Figure [5] and figure [7] are 
the corresponding cumulative distribution. By observing these figures, we know that: 

— Most communities have a core/periphery structure, with a small core in central positions 
and some low degree nodes in the periphery. 

— The ISlopes largely determine the structure of the communities. 

— There are indeed some typical thresholds at which the distribution curve decreases sharply 
in most networks. 

— The typical values of ESlopes are more obvious than that of the ISlopes in the citation and 
collaboration networks, in which, the ESlopes of most communities lie in a very narrow 
interval. 

— Communities of the email-euall network have much larger ISlopes and ESlopes in general. 

— The ISlopes and ESlopes of all the communities of the citation and collaboration networks 
approximately follow a normal distribution. 
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(a) ISlopc_football (b) ISlope_citation 




(c) ISlope_collaboration (d) ISlope_email 

Fig. 4. Distribution of communities' ISlope 

Table III. Statistics of communities. APL represents 
average path length, D represents diameter, CCC rep- 
resents community clustering coefficient and NCC 
represents the network clustering coefficient. All the 
results except NCC are calculated by averaging the 
corresponding property of all communities 



Network 


APL 


D 


CCC 


NCC 


football 


1.8 


3.2 


0.6 


0.41 


citjrepth 


2.9 


7 


0.36 


0.12 


cit_hcpph 


2.7 


6.7 


0.29 


0.15 


coLastroph 


2.2 


4.5 


0.71 


0.32 


coLcondmat 


2.7 


5.4 


0.53 


0.26 


coLgrqc 


2.4 


4.7 


0.51 


0.63 


coLhcpth 


3.3 


7.2 


0.39 


0.28 


coLhcpph 


2.8 


5.9 


0.65 


0.66 


cmaiLcnron 


2.2 


4.1 


0.39 


0.085 


cmaiLcuall 


2.3 


3.5 


0.0019 


0.0042 



5. MORE GENERAL PROPERTIES 

In the last section, we show that the internal slope (ISlope) of a community basically 
determines the structure of the community. In this section, we study more general properties 
of the communities. In particular, we consider the average distances, average diameters and 
average clustering coefficients of all the communities in each of the real networks, for which 
the results are given in Table IIIII 
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(c) ESlopc_collaboration (d) ESlope_email 

Fig. 5. Distribution of communities' ESlopc 



The distance between two nodes is defined as the number of "hops" in the network 
one needs to move from one given node to another [Newman 2004a] . Usually people are 
interested in the average distances of the whole network [Milgram 1967| INewman 2001bl 
INewman et al. 20011 Tr avers and Milgram 1969 , showing that most real networks have very 
short average distances. In this section, we consider the average distance between two nodes 
within a community, which represents the number of "hops" one needs to move from one 
node to another only through members of the same community. 

From Table IIII1 we have that, the communities of each network have a small average 
distance. In particular, the average distance of all the communities of the collaboration 
network hepth reaches 3.3, which is the largest value of the average distances of all the 
communities for all the networks studied in this paper. Besides, we also give the average 
diameter of communities. The average diameter of all the communities for each of the 
networks is between 3.2 and 7.2. This experiment suggests a conjecture that: there is a three 
degree separation property of (true) communities for many real networks. The conjecture 
calls for further investigation, which may provide useful information for understanding both 
true communities and communities found by various algorithms. 

Clustering coefficient (or transitivity) has been a well studied property for networks 
[Newman 200lal E ewman 2001b[ [Watts and Strogatz 1998] . It refers to the phenomenon 



that the existence of ties between nodes A and B and between nodes B and C implies a tie 
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(c) ISlope_collaboration (d) ISlope_email 

Fig. 6. Cumulative distribution of communities' ISlopc 

between A and C. Given a graph G, the clustering coefficient of G is defined by: 

3 x number of triangles on the graph 



C = 



number of connected triples of vertices 



(5) 



From tabic Hill wc observe that most communities of the networks have very large cluster- 
ing coefficients except for that of the emaiLeuall network, and that most small communities 
found by our algorithm have larger clustering coefficients than that of the corresponding 
original graphs. 

However, in the collaboration network grqc, the clustering coefficient of the original graph 
is 0.63, but many small communities wc found have smaller clustering coefficients. In fact, 
communities with clustering coefficients less than 0.6 take up more than 74% of the com- 
munities in this network. To explain this phenomenon, we count the triangles in the original 
graph and its communities respectively. In the original graph, there are 1, 350, 014 triangles 
in all. If we divide the communities into two groups, so that the first group consists of 
the ones having clustering coefficients larger than 0.6, and the second group consists of the 
rest of the communities, then we discover that communities in the first group have 3, 306 
triangles on the average, while communities in the second group contain 60 triangles on 
the average. If we divide communities by clustering coefficient 0.8 as above, then the aver- 
age numbers of triangles appear in the communities in the first and the second classes are 
5, 027 and 147 respectively. Therefore the triangles are unevenly distributed in communities 
with a small number of communities containing most of triangles of the network. The high 
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Fig. 7. Cumulative distribution of communities' ESlope 



clustering coefficients are mainly caused by the small group of communities which contain 
much larger number of triangles. 

From table IIII1 we observe that clustering coefficients of communities vary among dif- 
ferent types of networks. Communities in collaboration networks have higher clustering 
coefficients than that of citation and email networks. In the collaboration networks, two 
authors having common collaborators are more likely to collaborate with each other in the 
future. In the citation networks, an author citing a paper, tends to cite the references of 
the paper, especially when the references are from the same topic. This explains the reason 
why collaboration networks and citation networks have higher clustering coefficients. 

Email networks have different patterns. Communities in cmaiLcnron network have aver- 
age clustering coefficient 0.39, at the same time, the origin graph has clustering coefficient 
only 0.085. In this case, the communities found by our algorithm largely amplify the cluster- 
ing coefficients of the network. This means that although the network has a small clustering 
coefficient, there are also significantly many local structures of the network showing strong 
cohesion among themselves. However communities in emaiLeuall network has the lowest 
clustering coefficient (only 0.0019). Both its origin and communities have very small clus- 
tering coefficients. In this case, most communities in this network are very similar to star-like 
graphs which have clustering coefficients near 0. This local structure of the network is very 
much different from other networks. 
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6. CONCLUSIONS 

In this paper, we propose a methodology to characterize and analyze the local structures 
and information of real networks, which includes new notions of internal dominating set, 
external dominating set, internal slope and external slope of a community, and analysis of 
the distributions of internal and external slopes, average distances, diameters, and clustering 
coefficients of all communities for each of the real networks. 

We implement experiments of our method on five collaboration networks, two citation 
networks, two email networks and one benchmark network. 

The experiments show that: 1) The notions of internal dominating ratio, external domi- 
nating ratio, internal slope and external slope and clustering coefficients are essential char- 
acteristics to understand the patterns and information of the communities of a real network. 
2) Different networks have different local structures (or patterns). 3) Most communities of 
a real network have a small internal dominating set and a small external dominating set, 
although the communities may still very large. 4) The small dominating set of a community 
keeps much of the information of the community and more importantly the information 
of a community can be extracted from the internal dominating set of the community. 5) 
Both internal and external slopes of all the communities of a network approximately follow 
a normal distribution for most real networks. This means that typical communities of the 
networks have both ISlopes and ESlopes in some small intervals, so that the communities 
have similar patterns. 6) The internal slope (ISlope) of a community basically determines 
the structure of the community. 7) The result that communities have average distances 
less than or equal to 3.3, implies a general conjecture that there is a 3 degree separation 
phenomenon of true communities of most real networks. 8) Normally, communities amplify 
the clustering coefficients of the corresponding network. 9) If a reasonably good algorithm 
fails to find communities that amplify clustering coefficients of the network, then the com- 
munities explore special structures of the network. 

The discoveries above are significant in both understanding the structures of networks, 
and in practical applications. Most communities in real networks are not regular or star-like 
graphs, but they usually appear with some central nodes with periphery around forming a 
core/periphery structure. Such structure favors the evolution of communities. A small set 
of nodes lead to the formation and evolution of the communities. Our results also indicate 
that in real communities, a single node could rarely take absolute central position as in 
star-like graphs, due to the reason that such structures are highly unstable. Our analysis 
provides some intuitive pictures of the rich communities of a network. 

In best of our knowledge, this is the first time we can rigorously analyze the characteris- 
tics and patterns, and extract information of the communities of a real network, although 
there are already a huge number of community detection algorithms in the literature. The 
significance of the research are three folds: 1) To understand the local structures and con- 
necting patterns of a network. 2) To extract useful information from the communities of a 
network. 3) To help to judge the community finding algorithms. 

Our future project (in progress) is to understand the roles of the small internal and 
external dominating sets in the formation and evolution of communities, and to understand 
the mechanisms of the patterns of the communities. 
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